New cardinality estimation algorithms for HyperLogLog sketches
نویسنده
چکیده
This paper presents new methods to estimate the cardinalities of multisets recorded by HyperLogLog sketches. A theoretically motivated extension to the original estimator is presented that eliminates the bias for small and large cardinalities. Based on the maximum likelihood principle a second unbiased method is derived together with a robust and efficient numerical algorithm to calculate the estimate. The maximum likelihood approach can also be applied to more than a single HyperLogLog sketch. In particular, it is shown that it gives more precise cardinality estimates for union, intersection, or relative complements of two sets that are both represented by HyperLogLog sketches compared to the conventional technique using the inclusion-exclusion principle. All the new methods are demonstrated and verified by extensive simulations.
منابع مشابه
New Cardinality Estimation Methods for HyperLogLog Sketches
is work presents new cardinality estimation methods for data sets recorded by HyperLogLog sketches. A simple derivation of the original estimator was found, that also gives insight how to correct its deciencies. e result is an improved estimator that is unbiased over the full cardinality range, is easy computable, and does not rely on empirically determined data as previous approaches. Based...
متن کاملBack to the Future: an Even More Nearly Optimal Cardinality Estimation Algorithm
We describe a new cardinality estimation algorithm that is extremely space-efficient. It applies one of three novel estimators to the compressed state of the Flajolet-Martin-85 coupon collection process. In an apples-to-apples empirical comparison against compressed HyperLogLog sketches, the new algorithm simultaneously wins on all three dimensions of the time/space/accuracy tradeoff. Our proto...
متن کاملHyperLogLog: the analysis of a near-optimal cardinality estimation algorithm
This extended abstract describes and analyses a near-optimal probabilistic algorithm, HYPERLOGLOG, dedicated to estimating the number of distinct elements (the cardinality) of very large data ensembles. Using an auxiliary memory of m units (typically, “short bytes”), HYPERLOGLOG performs a single pass over the data and produces an estimate of the cardinality such that the relative accuracy (the...
متن کاملLogLog-Beta and More: A New Algorithm for Cardinality Estimation Based on LogLog Counting
—The information presented in this paper defines LogLog-Beta (LogLog-β). LogLog-β is a new algorithm for estimating cardinalities based on LogLog counting. The new algorithm uses only one formula and needs no additional bias corrections for the entire range of cardinalities, therefore, it is more efficient and simpler to implement. Our simulations show that the accuracy provided by the new al...
متن کاملCoherent random permutations with record statistics
157 HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm A two-parameter family of random permutations of [n] is introduced, with distribution conditionally uniform given the counts of upper and lower records. The family interpolates between two versions of Ewens' distribution. A distinguished role of the family is determined by the fact that every sequence of coherent p...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1702.01284 شماره
صفحات -
تاریخ انتشار 2017